Table of contents

  1. Introduction
    1.1. Description of our data
    1.2. Getting the data
    1.3. Cleaning the data
  2. Graphical Analysis
    2.1 Map of earthquakes
  3. Linear Model 3.1. 3.2. 3.3.
  4. Generalised Linear Model Poisson
  5. Generalised Linear Model Binomial
  6. Generalised Additive Model
  7. Neural Network
  8. Support Vector Machine
  9. Optimisation Problem
  10. Conclusion


1. Introduction

Earthquakes are one of the most destructive natural disasters that can strike without warning, causing extensive damage to infrastructure, loss of life, and massive economic losses. While we cannot prevent earthquakes from occurring, the ability to accurately predict when and where they might occur could save countless lives and minimize the damage caused.

Therefore, our aim with this report is to contribute the significant earthquake prediction which enables to provide advanced warning of potentially catastrophic seismic events, allowing governments and communities to prepare and take necessary measures to minimize the impact of such events.


1.1.Description of our data

Data source: https://www.ngdc.noaa.gov/hazel/view/hazards/earthquake/search

The Significant Earthquake Database contains information on destructive earthquakes from 2150 B.C. to the present that meet at least one of the following criteria: Moderate damage (approximately $1 million or more), 10 or more deaths, Magnitude 7.5 or greater, Modified Mercalli Intensity X or greater, or the earthquake generated a tsunami. The database can also be displayed and extracted with the Natural Hazards Interactive Map.


Below we are listing a short summary of our main variables. At the primary and secondary deaths and damages where available the total numbers have been added to the dataset, in the “description” field the variables have already been clustered.


Valid values: 0.0 to 9.9

The value in this column contains the primary earthquake magnitude. Magnitude measures the energy released at the source of the earthquake. Magnitude is determined from measurements on seismographs. For pre-instrumental events, the magnitudes are derived from intensities. There are several different scales for measuring earthquake magnitudes. The primary magnitude is chosen from the available magnitude scales in this order:

Mw Magnitude
Ms Magnitude
Mb Magnitude
Ml Magnitude
Mfa Magnitude
Unknown Magnitude


Valid values: 1 to 12

The Modified Mercalli Intensity (Int) is given in Roman Numerals (converted to numbers in the digital database). An interpretation of the values is listed below.

Table 1. Modified Mercalli Intensity Scale of 1931

  1. Not felt except by a very few under especially favorable circumstances.

  2. Felt only by a few persons at rest, especially on upper floors of buildings. Delicately suspended objects may swing.

  3. Felt quite noticeably indoors, especially on upper floors of buildings, but many people do not recognize it as an earthquake. Standing motor cars may rock slightly. Vibration like passing truck. Duration estimated.

  4. During the day felt indoors by many, outdoors by few. At night some awakened. Dishes, windows, and doors disturbed; walls make creaking sound. Sensation like heavy truck striking building. Standing motorcars rock noticeably.

  5. Felt by nearly everyone; many awakened. Some dishes, windows, etc., broken; a few instances of cracked plaster; unstable objects overturned. Disturbance of trees, poles, and other tall objects sometimes noticed. Pendulum clocks may stop.

  6. Felt by all; many frightened and run outdoors. Some heavy furniture moved; a few instances of fallen plaster or damaged chimneys. Damage slight.

  7. Everybody runs outdoors. Damage negligible in buildings of good design and construction slight to moderate in well built ordinary structures; considerable in poorly built or badly designed structures. Some chimneys broken. Noticed by persons driving motor cars.

  8. Damage slight in specially designed structures; considerable in ordinary substantial buildings, with partial collapse; great in poorly built structures. Panel walls thrown out of frame structures. Fall of chimneys, factory stacks, columns, monuments, walls. Heavy furniture overturned. Sand and mud ejected in small amounts. Changes in well water. Persons driving motor cars disturbed.

  9. Damage considerable in specially designed structures; well-designed frame structures thrown out of plumb; great in substantial buildings, with partial collapse. Buildings shifted off foundations. Ground cracked conspicuously. Underground pipes broken.

  10. Some well-built wooden structures destroyed; most masonry and frame structures destroyed with foundations; ground badly cracked. Rails bent. Landslides considerable from river banks and steep slopes. Shifted sand and mud. Water splashed over banks.

  11. Few, if any (masonry), structures remain standing. Bridges destroyed. Broad fissures in ground. Underground pipelines completely out of service. Earth slumps and land slips in soft ground. Rails bent greatly.

  12. Damage total. Waves seen on ground surfaces. Lines of sight and level distorted. Objects thrown upward into the air.


The depth of the earthquake is given in kilometers.


Associated Tsunami or Seiche [Tsu]
When a tsunami or seiche was generated by an earthquake, An icon appears in the Associated Tsunami column which is linked to the tsunami event database. The link will display additional tsunami event information.

Volcano [Vol]
The Volcano link will display additional information if the earthquake was associated with a volcanic eruption. The information may include information such as the VEI index, morphology, and the effects of the eruption.


Description of Deaths from the Earthquake [Death.Description]

Valid values: 0 to 4

When a description was found in the historical literature instead of an actual number of deaths, this value was coded and listed in the Deaths column. If the actual number of deaths was listed, a descriptor was also added for search purposes.

0 None
1 Few (~1 to 50 deaths)
2 Some (~51 to 100 deaths)
3 Many (~101 to 1000 deaths)
4 Very many (over 1000 deaths)

Description of Damage from the Earthquake [Damage.Description]

Valid values: 0 to 4

For those events not offering a monetary evaluation of damage, the following five-level scale was used to classify damage (1990 dollars) and was listed in the Damage column. If the actual dollar amount of damage was listed, a descriptor was also added for search purposes.

0 NONE
1 LIMITED (roughly corresponding to less than $1 million)
2 MODERATE (~$1 to $5 million)
3 SEVERE (~$5 to $25 million)
4 EXTREME (~$25 million or more)

[Missing.Description]
[Injuries.Description]
[Houses.Destroyed.Description]
[Houses.Damaged.Description]


Description of Deaths from the Earthquake and secondary effects (eg Tsunami)
Valid values: 0 to 4

When a description was found in the historical literature instead of an actual number of deaths, this value was coded and listed in the Deaths column. If the actual number of deaths was listed, a descriptor was also added for search purposes.

0 None
1 Few (~1 to 50 deaths)
2 Some (~51 to 100 deaths)
3 Many (~101 to 1000 deaths)
4 Very many (over 1000 deaths)

[Total.Death.Description]
[Total.Missing.Description]
[Total.Injuries.Description]
[Total.Damage…Mil.]
[Total.Houses.Destroyed.Description]
[Total.Houses.Damaged.Description]


1.2.Getting the data

We first setting the working directory, than we are loading the tab separated data file to R.

setwd("C:/Users/A/Documents/FS23_ML1/Project Data/Git_repository/ML01")
eqdata <- read.csv("significant-earthquake-database.tsv", header = TRUE, sep = "\t")

At first we will have a look at the data provided:

str(eqdata)
## 'data.frame':    6365 obs. of  39 variables:
##  $ Search.Parameters                 : chr  "[]" "" "" "" ...
##  $ Year                              : int  NA -2150 -2000 -2000 -1610 -1566 -1450 -1365 -1250 -1050 ...
##  $ Mo                                : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Dy                                : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Hr                                : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Mn                                : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Sec                               : num  NA 0 NA NA NA 0 NA NA 0 0 ...
##  $ Tsu                               : int  NA NA 1 NA 3 NA NA 4 NA NA ...
##  $ Vol                               : int  NA NA NA NA 1351 NA NA NA NA NA ...
##  $ Location.Name                     : chr  "" "JORDAN:  BAB-A-DARAA,AL-KARAK" "SYRIA:  UGARIT" "TURKMENISTAN:  W" ...
##  $ Latitude                          : num  NA 31.1 35.7 38 36.4 ...
##  $ Longitude                         : num  NA 35.5 35.8 58.2 25.4 35.3 25.5 35.8 35.5 35 ...
##  $ Focal.Depth..km.                  : int  NA NA NA 18 NA NA NA NA NA NA ...
##  $ Mag                               : num  NA 7.3 NA 7.1 NA NA NA NA 6.5 6.2 ...
##  $ MMI.Int                           : int  NA NA 10 10 NA 10 10 NA NA NA ...
##  $ Deaths                            : int  NA NA NA 1 NA NA NA NA NA NA ...
##  $ Death.Description                 : int  NA NA 3 1 NA NA NA NA NA NA ...
##  $ Missing                           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Missing.Description               : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Injuries                          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Injuries.Description              : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Damage...Mil.                     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Damage.Description                : int  NA 3 NA 1 NA 3 NA 3 3 3 ...
##  $ Houses.Destroyed                  : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Houses.Destroyed.Description      : int  NA NA NA 1 NA NA NA NA NA NA ...
##  $ Houses.Damaged                    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Houses.Damaged.Description        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Total.Deaths                      : int  NA NA NA 1 NA NA NA NA NA NA ...
##  $ Total.Death.Description           : int  NA NA 3 1 3 NA NA NA NA NA ...
##  $ Total.Missing                     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Total.Missing.Description         : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Total.Injuries                    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Total.Injuries.Description        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Total.Damage...Mil.               : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Total.Damage.Description          : int  NA NA NA 1 3 NA NA 3 NA NA ...
##  $ Total.Houses.Destroyed            : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Total.Houses.Destroyed.Description: int  NA NA NA 1 NA NA NA NA NA NA ...
##  $ Total.Houses.Damaged              : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Total.Houses.Damaged.Description  : int  NA NA NA NA NA NA NA NA NA NA ...

The original data set has 6365 observations of 39 variables.


1.3 Cleaning the data

After loading we have decided to perform the below cleaning our dataset:

Exclude data dated before 1800

  • Our reasons are the following:

    • data available mostly from historical records therefore less reliable

    • the measurement quality is not reliable based on less developed methods

    • we had a lot of missing values from these records

Exclude data where magnitude is not available

    • the variable of magnitude has key importance in our analysis, the records where it is missing we cannot trust.
library(dplyr)

eqdata <- eqdata %>% filter(Year >= 1800)
eqdata <- eqdata %>% filter(!is.na(Mag))
sum(is.na(eqdata$Mag))
## [1] 0


str(eqdata)
## 'data.frame':    4028 obs. of  39 variables:
##  $ Search.Parameters                 : chr  "" "" "" "" ...
##  $ Year                              : int  1800 1801 1802 1802 1803 1803 1804 1805 1806 1806 ...
##  $ Mo                                : int  NA 10 10 12 2 9 7 6 3 6 ...
##  $ Dy                                : int  NA 5 26 9 1 1 10 16 25 11 ...
##  $ Hr                                : int  NA NA 10 5 NA NA 13 8 NA NA ...
##  $ Mn                                : int  NA NA 55 NA NA NA NA 15 NA NA ...
##  $ Sec                               : num  NA NA 0 NA NA 0 NA 0 NA NA ...
##  $ Tsu                               : int  NA NA 5843 588 NA NA 593 NA NA NA ...
##  $ Vol                               : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Location.Name                     : chr  "IRAN:  DAMAVAND" "MEXICO:  OAXACA" "ROMANIA:  CARPATHIAN FOLD,VRANCEA" "JAPAN:  NW HONSHU:  SADO ISLAND, OGI" ...
##  $ Latitude                          : num  36.2 16.2 45.7 37.7 25.6 ...
##  $ Longitude                         : num  53.3 -96.1 26.6 138.3 100.6 ...
##  $ Focal.Depth..km.                  : int  NA NA 150 NA NA NA NA 15 NA NA ...
##  $ Mag                               : num  6.5 7 7.4 6.6 6 7.5 7.3 6.1 7.5 7.7 ...
##  $ MMI.Int                           : int  NA NA 10 NA 8 NA NA 8 8 NA ...
##  $ Deaths                            : int  NA 7 NA 19 200 NA 450 200 2000 100 ...
##  $ Death.Description                 : int  NA 1 NA 1 3 3 3 3 4 2 ...
##  $ Missing                           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Missing.Description               : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Injuries                          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Injuries.Description              : int  NA 2 NA NA NA NA NA 3 NA NA ...
##  $ Damage...Mil.                     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Damage.Description                : int  3 3 2 3 2 3 4 3 4 2 ...
##  $ Houses.Destroyed                  : int  NA NA NA 1060 NA NA 10810 NA NA NA ...
##  $ Houses.Destroyed.Description      : int  NA NA NA 4 2 3 4 3 3 3 ...
##  $ Houses.Damaged                    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Houses.Damaged.Description        : int  NA NA 2 NA NA 3 NA 3 NA NA ...
##  $ Total.Deaths                      : int  NA 7 NA 19 NA NA 450 200 2000 100 ...
##  $ Total.Death.Description           : int  NA 1 NA 1 NA 3 3 3 4 2 ...
##  $ Total.Missing                     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Total.Missing.Description         : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Total.Injuries                    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Total.Injuries.Description        : int  NA 2 NA NA NA NA NA 3 NA NA ...
##  $ Total.Damage...Mil.               : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Total.Damage.Description          : int  NA 3 2 3 2 3 4 3 4 2 ...
##  $ Total.Houses.Destroyed            : int  NA NA NA 1060 NA NA 10810 NA NA NA ...
##  $ Total.Houses.Destroyed.Description: int  NA NA NA 4 2 3 4 3 3 3 ...
##  $ Total.Houses.Damaged              : int  NA NA NA NA NA NA 300 NA NA NA ...
##  $ Total.Houses.Damaged.Description  : int  NA NA 2 NA NA 3 3 3 NA NA ...

The original data set has 6365 observations of 39 variables.


After cleaning the dataset we have 4028 observations of 39 variables remaining. We will perform all the following analysis with this dataset.


2. Graphical Analysis


2.1 Map of earthquakes

library("rnaturalearth")
library("rnaturalearthdata")
library("sf")
library("ggplot2")

world <- ne_countries(scale = "medium", returnclass = "sf")
class(world)
## [1] "sf"         "data.frame"


3. Linear Model


3.1 Characteristics of a linear model

Linear regression models are unsupervised models, which means we want to predict how the dependent variable changes with changing the independent variables. Regression models the dependent variable takes quantitative measures.(this means Y value is an increasing or decreasing number whith changing the value of the dependent variables.)

Generally linear models are never completely correct, but the interpretability of the linear model is relatively high compared to other more complex models. The danger of overfitting is generally less with linear models.

Linear regression can be performed with fitting a single predictor or a multiple predictors variables at the same time. Generally one starts with fitting all the relevant predictors.This is the reason, we start our statistical analysis with a multiple linear regression below.


3.2 Our research questions

The scope of this analysis is to be able to make predictions about number of deaths caused by earthquakes. This is to understand how magnitude, intensity, focal depth,time and location correspond to the total number of deaths by earthquakes.

We are going to look at which independent variables have an effect on the:

  • Y1: total number of deaths (with secondary effects):
    [Total.Damage…Mil.]
  • Y2: total cost of damage (USD) (with secondary effects):
    [Total.Deaths]

Remark: We are going to use the backward evaluation model, which means we are fitting all relevant the independent variables available in our data set, and removing them one by one looking at how the significance changes.

In both cases we will examine the below independent variables:

  • magnitude [Mag]

  • intensity [MMI.Int]

  • focal depth [Focal.Depth..km]

  • Year [Year]: Remark: month, date, day hour and sec we do not include, given we do not see an added value here.

  • Location [Latitude],[Longitude]

  • Volcano, Tsunami: [Vol],[Tsu]

  • Location name: We do not include this either, given it would lead to too many levels, which is problematic using linear regression. We have the Lattitude and longitude, included which has an indication on location.


3.3 Data Cleaning and fitting linear model

Total Death numbers, total damage costs, intensity and focal depth are not available in each row of our data set. Given we cannot predict these values we have decided to remove the rows, where any of these values is NA. We are also turning NA`s in Tsu and Vol column to 0 in order we can fit the linear model below.

  • Our final data set eqdata.no.na.total.death has 678 rows.


## 
## Call:
## lm(formula = Total.Deaths ~ Mag + MMI.Int + Focal.Depth..km. + 
##     Year + Vol + Tsu + Latitude + Longitude, data = eqdata.no.na.total.death)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -13000  -4155  -1935    502 232189 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -2.401e+04  3.495e+04  -0.687 0.492247    
## Mag               2.252e+03  8.069e+02   2.791 0.005396 ** 
## MMI.Int           1.353e+03  4.056e+02   3.337 0.000895 ***
## Focal.Depth..km. -1.636e+01  1.826e+01  -0.896 0.370614    
## Year              4.270e-01  1.657e+01   0.026 0.979446    
## Vol              -2.191e-01  1.509e+00  -0.145 0.884613    
## Tsu              -1.835e-01  4.105e-01  -0.447 0.655003    
## Latitude          3.187e+01  2.792e+01   1.142 0.254018    
## Longitude         9.191e+00  7.151e+00   1.285 0.199144    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14730 on 669 degrees of freedom
## Multiple R-squared:  0.0638, Adjusted R-squared:  0.0526 
## F-statistic: 5.699 on 8 and 669 DF,  p-value: 4.992e-07


Above we can see the Adjusted R2 value is 0.0526, which means that our model explains only 6 % the relationship of the variables. This is a very low result.

The individual variables except 2 of them are larger than 0.005 which indicates, that they do not seem to have a meaningful effect on our dependent variable.

In the code below, we will be removing them one by one, and will not comment those results where the results do not change meaningfully.

As a result we end up keeping the Magnitude and Intensity which are the 2 independent variables which seem to be relevant for our model.


## 
## Call:
## lm(formula = Total.Deaths ~ Mag + MMI.Int, data = eqdata.no.na.total.death)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11671  -4284  -1815    636 233936 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -20826.0     4245.5  -4.905 1.17e-06 ***
## Mag           1796.1      695.0   2.584  0.00996 ** 
## MMI.Int       1471.6      359.7   4.092 4.80e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14720 on 675 degrees of freedom
## Multiple R-squared:  0.05645,    Adjusted R-squared:  0.05366 
## F-statistic: 20.19 on 2 and 675 DF,  p-value: 3.04e-09


Colinearity

Now we are looking at the correlation matrix, to check whether there is a collinearity between Magnitude and Intensity.



As we can see the correlation coefficients are not too high, so there is no colinearity between Magnitife and Intensity.

An other indicator to check this is the VIF value: Which is not high either.


##   Variables Tolerance      VIF
## 1       Mag  0.808905 1.236239
## 2   MMI.Int  0.808905 1.236239


Finally, let us check an interpret the coefficients of the linear model.

The intercept is below zero, which is very hard to interpret for our variable. This may not even make sens.

The coefficient for magnitude tells us, that with each unit increase of a magnitude the number of deaths increases with 1796.

The coefficient of the intensity, tells us, with unchanged value of the intercept and the magnitude, if the intensity is one unit higher, it is likely that the number of deaths increases additionnaly with 1471.


3.4 Graphical analysis Magnitude and Intensity

Let`s look at in 2 dimension, how our data plots Magnitude and Intensity plotted against the number of total deaths looks like:



The blue line ia a regression line. We can see in both cases that our linear regresion line on both cases does not really fits to the data points especially at higher values of magnitude and Intensity.


If we plot in 3D our data points, as well as the plaine, which visualizes our linear model, looking at the data we have the below assumptions:

  1. There is non-linearity in our model
  2. Data is not normally distributed
  3. There is heteroskedasticity
  4. The data contains outliers and high leverage points




To investigate on our assumptions, we plot the residual against the fitted values.

Figure below for Magnitude vs. Total Death:


Figure below for Intensity vs. Total Death:


In both figures we can see 4 plots which we explain below:

Residuals vs Fitted Values

The red line is linear, however towards higher values majority of our data points start to be divide from the line. This is the place where we also have our outliers.This may indicate some non-lineriality in our data.

Normal Q-Q

This chart we can see how the distribution of the data is. In case the most data points lie in the line, the data is normally distributed. In our case towards higher values the majority of our data points divergates from the line, this means that in our model the residuals aren’t Gaussian and thus your errors aren’t either, so our data is not normally distributed.

Scale Location

This chart we can see if there is heteroskedasticity in our data. On both charts, the red line is approximately horizontal, this means that the average magnitude of the standardized residuals isn’t changing much as a function of the fitted values. However the distribution of the datapoints concentrates around the value 5000 and in lower and higher values we have significantly less datapoints. This indicates heteroskedasticity.

Residuals vs Leverage

In the plotted data we can already see that there are some data points “far” away, or outlying. This chart shows which outliers are high leverage points, which means their change or removal influences more our model as the other datapoints. At Magnitude as well as at Intensity we have few such a points which lie outside of the line indicating the Cook`s distance.


To sum up our findings above, we can conclude, that our linear model fitted with Intensity and Magnitude is not the best fit for predict the number of deaths caused by earthquakes. Therefoer below we are going to look at more complex models.


4. Generalised Linear Model Poisson


5. Generalised Linear Model Binomial


6. Generalised Additive Model

Difference GLM vs. GAM GAM do not assume a priori any specific form of this relationship, and can be used to reveal and estimate non-linear effects of the covariate on the dependent variable.

7. Neural Network


8. Support Vector Machine


9. Optimisation Problem


10. Conclusion